Outliers are frequently found in data sets and can cause problems for researchers if not addressed. Failure to identify and deal with outliers in an appropriate manner may lead researchers to report erroneous results. Using a multiple regression context, this paper examines some of the reasons for the presence of outliers and simple methods for identifying them. Heuristic data sets and scatterplots provide illustrations of the concepts discussed.(Contains 2 figures, 2 tables, and 11...

This paper considers the use of commonality analysis as an effective tool for analyzing relationships between variables in multiple regression or canonical correlational analysis (CCA). The merits of commonality analysis are discussed and the procedure for running commonality analysis is summarized as a four-step process. A heuristic example is offered as a demonstration of the use of commonality analysis, and the potential limitations and advantages of commonality analysis are discussed. An...

Outliers are extreme data points that have the potential to influence statistical analyses. Outlier identification is important to researchers using regression analysis because outliers can influence the model used to such an extent that they seriously distort the conclusions drawn from the data. The effects of outliers on regression analysis are discussed, and examples of various detection methods are given. Most outlier detection methods involve the calculation of residuals. Given that the...

One of the innovative approaches in the use of hierarchical linear models (HLM) is to use HLM for Slopes as Outcomes models. This implies that the researcher considers that the regression slopes vary from cluster to cluster randomly as well as systematically with certain covariates at the cluster level. Among the covariates, group indicator variables at the cluster level, which classify the cluster units into several groups, are often found to be significant predictors. If this is the case, the...

This paper presents an overview of logistic regression and illustrates the method with the data transformations that are conducted. It also discusses the interpretation of logistic regression results. To make the discussion more concrete, an analysis of a data set is presented in which logistic regression is used to predict the likelihood of a college student's withdrawing or failing a course. Logistic regression is a well-suited analysis technique when a dichotomous dependent variable is...

Commonality analysis is a method of decomposing the R squared in a multiple regression analysis into the proportion of explained variance of the dependent variable associated with each independent variable uniquely and the proportion of explained variance associated with the common effects of one or more independent variables in various combinations. Unlike other variance partitioning methods (e.g., stepwise regression) that distort the results, commonality analysis considers all possible...

All parametric statistical analyses have certain assumptions about the data that must be met reasonably to warrant the use of a given analysis. Distributional normality, for example, is a common assumption. There is a variety of ways that data in a distribution may detract from normality, but one common problem is the presence of outliers. Many applied regression researchers, however, are unfamiliar with the potential role and process of robust regression procedures. Robust regression methods...

Multiple regression analysis is used with considerable frequency by researchers as a means of predicting the impact of predictor variables on a dependent variable. Regression predictors are typically correlated, often intentionally. To better understand the relative contribution of each independent variable in regression (and other) analyses, researchers can partition the squared multiple correlation (R squared) into constituent portions that can be attributed to the independent variables both...

Although the concept of the general linear model (GLM) has existed since the 1960s, other univariate analyses such as the t-test and the analysis of variance models have remained popular. The GLM produces an equation that minimizes the mean differences of independent variables as they are related to a dependent variable. From a computer printout of a regression analysis, the researcher can obtain weights that apply to each variable and then construct this equation. Certain univariate analyses...

This paper discusses the importance of interpreting both regression coefficients and structure coefficients when analyzing the results of multiple regression analysis, particularly with correlated predictor variables. The concepts of multicolinearity and suppressor effects are introduced, along with examples from the previously published articles that demonstrate how erroneous conclusions are drawn when researchers fail to consult both beta weights and structure coefficients (or both beta...

Among the computer-based methods used for the construction of trees such as AID, THAID, CART, and FACT, the only one that uses an algorithm that first grows a tree and then prunes the tree is CART. The pruning component of CART is analogous in spirit to the backward elimination approach in regression analysis. This idea provides a tool in controlling the tree sizes to some extent and thus estimating the prediction error by the tree within a certain range of tree size. In the CART pruning...

Background: An extensive body of researches has favored the use of regression over other parametric analyses that are based on OVA. In case of noteworthy regression results, researchers tend to explore magnitude of beta weights for the respective predictors. Purpose: The purpose of this paper is to examine both beta weights and structure coefficients in interpreting regression results. Data Collection and Analysis: Two heuristic examples will be illustrated. Findings: When predictor variables...

All parametric analysis focuses on the "synthetic" variables created by applying weights to "observed" variables, but these synthetic variables are called by different names across methods. This paper explains four ways of computing the synthetic scores in factor analysis: (1) regression scores; (2) M. S. Bartlett's algorithm (1937); (3) the Anderson-Rubin method (T. W. Anderson and H. Rubin, 1956); and (4) standardized, noncentered factor scores. A description and...

A modification of the usual graphical representation of heterogeneous regressions is described that can aid in interpreting significant regions for linear or quadratic surfaces. The standard Johnson-Neyman graph is a bivariate plot with the criterion variable on the ordinate and the predictor variable on the abscissa. Regression surfaces are drawn for each group. If there are regions of significance, their boundaries are noted either on the graph or in the text. If there is a manageable number...

Homoscedasticity is an important assumption of linear regression. This paper explains what it is and why it is important to the researcher. Graphical and mathematical methods for testing the homoscedasticity assumption are demonstrated. Sources of homoscedasticity and types of homoscedasticity are discussed, and methods for correction are demonstrated. Graphs are used to illustrate different patterns that may be caused by heteroscedasticity. An extensive example for using Weighted Least Squares...

Least squares methods are sophisticated mathematical curve fitting procedures used in all classical parametric methods. The linear least squares approximation is most often associated with finding the "line of best fit" or the regression line. Since all statistical analyses are correlational and all classical parametric methods are least square procedures, it becomes imperative to understand just what the least squares procedure is and how it works. This paper illustrates the least...

Discrete Choice Marketing (DCM), a research technique that has become more popular in recent marketing research, is described. DCM is a method that forces people to look at the combination of relevant variables within each choice domain and, with each option fully defined in terms of the values for those variables, make a choice of options. DCM provides more reliable and valid results than do its more simple survey relatives because it more closely resembles the environment in which people...

Partial and part correlations are discussed as a means of statistical control. Partial and part correlation coefficients measure relationships between two variables while controlling for the influences of one or more other variables. They are statistical methods for determining whether a true correlation exists between a dependent and an independent variable while controlling for one or more other variables. This paper discusses the use and limitations of partial correlations, and presents...

Researchers in education and the social sciences make extensive use of linear regression models in which the dependent variable is continuous-valued while the explanatory variables are a combination of continuous-valued regressors and dummy variables. The dummies partition the sample into groups, some of which may contain only a few observations. Such groups may easily include enough outliers to break down the parameter estimates. Models with many fixed or random effects appear to be especially...

The concept of the general linear model (GLM) is illustrated and how canonical correlation analysis is the GLM is explained, using a heuristic data set to demonstrate how canonical correlation analysis subsumes various multivariate and univariate methods. The paper shows how each of these analyses produces a synthetic variable, like the Yhat variable in regression. Ultimately these synthetic variables are actually analyzed in all statistics, a fact that is important to researchers who want to...

This paper reviews issues involved in converting continuous variables to nominal variables to be used in the OVA techniques. The literature dealing with the dangers of dichotomizing continuous variables is reviewed. First, the assumptions invoked by OVA analyses are reviewed in addition to concerns regarding the loss of variance and a reduction in score reliability that result from the conversion of continuous variables to nominal variables. Second, regression is discussed as a more adequate...

The assumption that is most important to the hypothesis testing procedure of multiple linear regression is the assumption that the residuals are normally distributed, but this assumption is not always tenable given the realities of some data sets. When normal distribution of the residuals is not met, an alternative method can be initiated. As an alternative, data for one or more of the variables under study can be transformed in order to increase conformity to the required distributional...

Multiple regression is a useful statistical technique when the researcher is considering situations in which variables of interest are theorized to be multiply caused. It may also be useful in those situations in which the researchers is interested in studies of predictability of phenomena of interest. This paper provides an introduction to regression analysis, focusing on five major questions a novice user might ask. The presentation is set in the framework of the general linear model and...

The business of science is formulating generalizable insight. No one study, taken singly, establishes the basis for such insight. Meta-analysis, however, can be used to determine if results generalize and to estimate the mean and the variance of effect sizes across studies (J. Hunter and F. Schmidt, 1990). Meta-analysis inquiries treat studies (rather than people) as the units of analysis, and then use regression or other methods to determine the study features that explain or predict...

Presented at the Annual Meeting of the American Educational Research Association (AERA) in April 2009. Compares results of different approaches to propensity-score matching with hierarchical data.

One of the major problems that a tree-approach to data analysis often encounters is the instability of tree-structures. The instability issue must be dealt with before data can be interpreted by this method. Examining instability at a node of a tree provides insight into the instability of the whole tree, because the same theory of instability applies to all the nodes. This paper deals with the instability issue at a single node of a tree. It is assumed that the data are from a regression...

The interactive effects of Air Force occupational specialty and personnel characteristics on predictions of tenure for first-term enlisted airmen were studied. Historical data files were compiled on 280,039 Air Force enlistees. Two classes of variables were extracted for the sample: personnel characteristics including age, sex, race, educational background, aptitude scores, and occupational assignments identifying the enlistee's Air Force specialty code (AFSC). Two tenure criteria were...

This paper gives concise descriptions of a robust location statistic, the remedian of P. Rousseeuw and G. Bassett (1990) and a robust measure of dispersion, the "Sn" of P. Rousseeuw and C. Croux (1993). The use of Sn in least absolute errors regression (L1) is discussed, and BASIC programs for both statistics are provided. The remedian is an iterated median that needs a small amount of memory to process large data sets. It is an attractive robust estimator of location in large samples...

Many researchers are unfamiliar with suppressor variables and how they operate in multiple regression analyses. This paper describes the role suppressor variables play in a multiple regression model and provides practical examples that explain how they can change research results. A variable that when added as another predictor increases the total correlation coefficient squared (R squared) is a suppressor variable. Suppressor variables measure invalid variance in the predictor measures and...

Linear regression examines the relationship between one or more independent (predictor) variables and a dependent variable. By using a particular formula, regression determines the weights needed to minimize the error term for a given set of predictors. With one predictor variable, the relationship between the predictor and the dependent variable is linear. With two predictors, this relationship becomes planar, and with three or more predictors, this relationship becomes hyper planar. By...

Missing data occur in virtually every study. This paper reviews some of the various strategies for addressing this problem. The paper also provides instructional detail on two accessible ways of estimating missing data, both using the Statistical Package for the Social Sciences for Windows: (1) substitution of missing values with the variable mean of nonmissing scores; and (2) replacement of missing values with estimates derived from regression. Nine tables and five appendixes provide details...

The information that is gained through various analyses of the residual scores yielded by the least squares regression model is explored. In fact, the most widely used methods for detecting data that do not fit this model are based on an analysis of residual scores. First, graphical methods of residual analysis are discussed, followed by a review of several quantitative approaches. Only the more widely used approaches are discussed. Example data sets are analyzed through the use of the...

Using a hypothetical data set of 24 cases concerning opinions on contemporary issues on which Democrats and Republicans might disagree, concrete examples are provided to illustrate that canonical correlation analysis is the most general linear model, subsuming other parametric procedures as special cases. Specific statistical techniques included in the analysis are "t"-tests, Pearson correlation, multiple regression, analysis of variance, multiple analysis of variance, and...

A regression procedure is developed to link simultaneously a very large number of item response theory (IRT) parameter estimates obtained from a large number of test forms, where each form has been separately calibrated and where forms can be linked on a pairwise basis by means of common items. An application is made to forms in which a two-parameter logistic model is applied to dichotomous items and a general partial credit model is applied to polytomous items.

This paper identifies specific problems with stepwise regression, notes criticisms of stepwise methods by statisticians, suggests appropriate ways in which stepwise procedures can be used, and gives examples of how this can be done. Although the stepwise method has been routinely criticized by statisticians, it is still frequently used in the literature. This paper suggests research situations when stepwise regression may have a valuable function. Stepwise methods can be appropriate for...

Although analysis of covariance (ANCOVA) is used fairly infrequently in published research, the method is used much more frequently in dissertations and in evaluation research. This paper reviews the assumptions that must be met for ANCOVA to yield useful results, and argues that ANCOVA will yield distorted and inaccurate results when these assumptions are violated. For ANCOVA to provide meaningful statistical control and to not obscure or mislead, it must be ascertained that the data set...

Many missing data studies have simulated data, randomly deleted values, and investigated the method of handling the missing values that would most closely approximate the original data. Regression procedures have emerged as the most recommended methods. If the values are missing randomly, these procedures are effective. If, however, the values are not missing randomly, the use of regression procedures to impute values for missing data is questionable. The purpose of this study was to determine...

An estimation tool for symmetric univariate nonlinear regression is presented. The method is based on introducing a nontrivial set of affine coordinates for diffeomorphisms of the real line. The main ingredient making the computations possible is the Connes-Moscovici Hopf algebra of these affine coordinates.

The paper stresses the importance of consulting beta weights and structure coefficients in the interpretation of regression results. The effects of multilinearity and suppressors and their effects on interpretation of beta weights are discussed. It is concluded that interpretations based on beta weights only can lead the unwary researcher to inaccurate conclusions. Despite warnings, though, researchers are still using only beta weights in the interpretation of regression analyses. A review of...

This paper reviews the literature on methods for dealing with missing data, discusses four commonly used methods, and illustrates these approaches with a small hypothetical data set. Most studies contain some missing data, and the reasons data are missing are many and varied. Four commonly used methods have been identified in the literature: (1) listwise deletion; (2) pairwise deletion; (3) mean imputation; and (4) regression imputation. Listwise deletion, which is the default in some...

This presentation discusses the use of a time series approach to the analysis of daily attendance in two urban high schools over the course of one school year (2009-10). After establishing that the series for both schools were stationary, they were examined for moving average processes, autoregression, seasonal dependencies (weekly cycles), outliers and heteroscedasticity. Seasonal dependencies were significant in both schools. In addition, contrary to what the traditional attendance statistics...

The effect of a nonlinear regression term on the behavior of the standard analysis of covariance (ANCOVA) F test was investigated for balanced and randomized designs through a Monte Carlo study. The results indicate that the use of the standard analysis of covariance model when a quadratic term is present has little effect on Type I error rates but produces a substantial power loss compared to theoretically expected values, often in excess of 20%. The extent of the power loss depends on the...

When survey data are statistically analyzed, many times some of the data is missing. If the missing values are not correctly handled, results of the analysis may be dubious and publication may jeopardize the credibility of the organization preparing the report. This study examined four of the more commonly used methods of handling missing data. The following techniques were compared: (1) listwise deletion; (2) pairwise deletion; (3) mean substitution; and (4) regression imputation of missing...

The use of repeated measures research designs is explored. Repeated measures designs are often advantageous and can be implemented in a variety of research settings. One of the main advantages in repeated measures designs is the control of subject variability. Other advantages are the reduction of error variance and economy in subject recruitment. A disadvantage is the carry-over practice, effect of the repetition. A heuristic example is presented to illustrate the different statistical and...

Methods of regression commonality analysis are generalized for use in canonical correlation analysis. An actual data set (involving educators' attitudes toward death and age, locus of control, religion, and occupational role in working with terminally ill children) is employed to illustrate the extension. The method can be applied with respect to each canonical function in an analysis to determine the proportion of explanatory power of a variable set which is unique, as well as the proportion...

The use of stepwise methodologies has been sharply criticized by several researchers, yet their popularity, especially in educational and psychological research, continues unabated. Stepwise methods have been considered particularly well suited for use in regression and discriminant analyses, but their use in discriminant analysis (predictive discriminant analysis and descriptive discriminant analysis) has not been the direct focus of as much written commentary. This discussion considers...

An alternative is proposed for the Johnson-Neyman procedure (P. O. Johnson and J. Neyman, 1936). Used when heterogeneous regression lines for two groups are analyzed, the Johnson-Neyman procedure is a technique in which the difference between the two linear regression surfaces for the criterion variate (Y) is estimated conditional on a realization of the predictor variate (X). The motivation of the alternate procedure is to estimate the point on the X variate at which two heterogeneous...

This study examined the reliability of three methods for detecting differential item functioning (DIF) (i.e., the Mantel-Haenszel method, the standardization method, and the logistic regression method) applied to achievement test data. In addition, the study examined the influences of different sources of error variance, including examinee, occasion, and curriculum sampling on the magnitude of the reliability of the different DIF detection methods. Three datasets were assembled from the 1992...

Logistic regression was used to develop appropriate weights for an academic admission index. A combined sample of 3-year freshman cohorts (fall 1996 through fall 1998) was used to develop the index. The weights in several logistic regression analyses for high school class percentile and ACT composite score predicting different college outcomes were taken into consideration to compose a simplified academic admission index. The effectiveness of the index was examined by several outcome measures...

The problem with "classical" statistics all invoking the mean is that these estimates are notoriously influenced by atypical scores (outliers), partly because the mean itself is differentially influenced by outliers. In theory, "modern" statistics may generate more replicable characterizations of data, because at least in some respects the influence of more extreme scores, which are less likely to be drawn in future samples from the tails of a non-uniform (non-rectangular or...

